2025-01-10
Many important outcomes in the social sciences are non-continuous
Binary or categorical outcomes: vote choice, choosing to ride a bike instead of driving, or winning an election
Counts: number of trade agreements, number of gun deaths in a city, or number of doctor visits in a year
Durations: length of a military conflict, the longevity of a coalition, or the time spent on unemployment
Fitting a line to this kind of outcome often produces nonsensical results: predictions below the minimum or above the maximum and slopes that have impossible increments. Moreover, the functional form is likely incorrect even where the predictions are possible because the slope depends on the current expected value of the dependent variable.
And the resulting standard errors will be non-normal and heteroskedastic by definition:
The NAVCO project collects data on “maximalist campaigns”, which are defined as sustained efforts to remove a government, attain territorial independence, or make fundamental changes to an existing constitutional order. A key question motivating the project is “what sorts of tactics are effective at creating major political change?”
For models with a single categorical predictor and a limited number of outcome categories, contingency table methods can give you a valid test of statistical significance
The \(\chi^2\) and Fisher’s Exact Test will work for categorical by categorical relationships.
Spearman’s \(\rho\) , Kendall’s \(\tau\), and Somer’s \(D\) can test ordered relationships.
\(P<.05\) in Fisher’s Exact test means a less than 5% chance of getting these cell frequencies if the rows and columns are uncorrelated.
library(gtsummary)
navco|>
select(outcome, goal_type)|>
tbl_summary(by=goal_type)|>
#add_p(test= outcome ~'chisq.test') # for x^2 test
add_p(test= outcome ~'fisher.test')| Characteristic | Other, N = 81 | Regime Change, N = 2481 | Secession, N = 481 | Self determination, N = 481 | p-value2 |
|---|---|---|---|---|---|
| outcome | <0.001 | ||||
| failure | 5 (63%) | 131 (53%) | 43 (90%) | 23 (48%) | |
| success | 3 (38%) | 117 (47%) | 5 (10%) | 25 (52%) | |
| 1 n (%) | |||||
| 2 Fisher’s exact test | |||||
Methods like Fisher’s Exact test are non-parametric and will be more conservative at smaller sample sizes compared to regression
Only valid for categorical-by-categorical relationships
Assuming the DV is coded as 0 or 1:
The constant is the proportion of successes for the omitted category
Each coefficient is the difference in the proportion of success compared to the constant.
Since these are all mutually exclusive categories, none of the predictions will be above 1 or below 0.
Good: If there is a single binary predictor, then a linear probability model is probably fine.
Violates Gauss Markov assumptions
Will generate nonsensical results (probabilities greater than 1 or less than zero)
Many machine learning methods can handle all sorts of weird data structures and many of them perform better than logit/probit models when predicting binary outcomes.
However! Machine learning means optimizing for prediction rather than hypothesis testing.
(technically, logistic regression is itself a machine learning method. The distinction here is mostly a question of research goals)
Tree-building algorithms use a process of recursively splitting the data at different levels of the predictors to produce decision tree. The model is non-parametric and can account for all sorts of complex non-linearity and interactive relationships in the predictors.
Most algorithms will perform automatic variable selection as part of the inference process, and a variety of modification exist to reduce the tendency for these models to become overfitted.
(ideally in this case you would want to create a held-out set and compare the predictive quality of different models)
If you pick up a textbook on Machine Learning, you’ll probably find some discussion of logistic regression, but the emphasis will generally be on prediction rather than hypothesis testing
Alternative classification methods like trees, support vector machines, bagging, neural networks etc. can often handle non-linearity, overfitting, interaction effects etc. more easily than models like logit/probit, but this flexibility can also make them “black boxes” that are harder to interpret.
(The following sections will deal with models for binary outcomes, but the general intuition applies to other kinds of discrete choices)
One way to think about a binary outcome is as the result of some continuous latent variable that depends on the predictors:
\[Y^*_i = \beta X_i + \epsilon_i \] That generates a \(1\) if \(Y > 0\) and \(0\) otherwise.
In a logit, we assume the cumulative distribution function of \(\epsilon_i\) follows a Logistic distribution. In a probit, we assume it comes from a Gaussian. Understood in this sense, this is really the only difference.
Changing the “intercept” term here will shift the curve to the left or the right
Changing values of the slope will make the line steeper (especially around the middle)
OLS tries to minimize the sum of squared errors, but we can’t actually infer that here
Instead, maximum likelihood methods attempt to identify the function that maximizes the likelihood of the observed data.
There’s no “closed form solution”, so MLE methods use optimization algorithms to fit models
\[ \ln L(\beta_0,\beta_1) = \sum_{i=1}^N \{ y_i \ln p(x_i; \beta_0,\beta_1) + (1-y_i) \ln [1-p(x_i;\beta_0,\beta_1)] \]
(technically optimization algorithms will minimize the inverse of this function)
set.seed(400)
N <-5000
alpha <- 0
beta <- 1
d<-tibble(
X = rnorm(N), # random IVs
y_star = alpha + X * beta, # a latent variable
Y = as.numeric((y_star + rlogis(N)) >0) # 1s and 0s (This is what we can observe!)
)
# the negative log likelihood function for the logistic model:
neg_log_likelihood = function(beta, x, y){
theta = beta * x
p <- exp(theta) / (1 + exp(theta))
val = -sum(y * log(p) + (1-y)*log(1-p))
return(val)
}Note that the negative log likelihood is minimized at the the correct value of \(\beta_1\):
The only thing you really need to know here, though, is that this is the thing we’re optimizing instead of minimizing the RSS. Maximizing this likelihood is non-deterministic, so it has to be approximated using an iterative process. This means it can fail to converge in some situations (usually small samples or lots of covariates)
The only real distinction here is that logistic models assume a logistic error, and probit models assume a Gaussian error. The interpretation of the coefficients will be different, but once they’re converted to predicted probabilities and marginal effects they should return very similar results.
| logit | probit | |
|---|---|---|
| X | 1.964 | 1.988 |
| [1.724, 2.221] | [1.770, 2.222] | |
| Num.Obs. | 1000 | 1000 |
| BIC | 935.8 | 644.9 |
| 95% CI in brackets |
The way we solve this is ultimately less important than how it changes interpretations. There’s a few key differences here when compared to OLS
| model 1 | model 2 | |
|---|---|---|
| X | 2.630 | 3.953 |
| [2.319, 2.969] | [3.450, 4.514] | |
| Z | 1.937 | |
| [1.633, 2.268] | ||
| Num.Obs. | 1000 | 1000 |
| BIC | 790.6 | 548.7 |
| 95% CI in brackets |
modelsummary(list(
"navco LPM" = navco_lpm,
"navco logit" =navco_logit,
"navco probit"= navco_probit
),
coef_rename=TRUE,
coef_omit = "Intercept",
estimate = "{estimate}",
statistic = c("conf.int"),
conf_level = .95,
note = "95% CI in brackets",
gof_omit = 'F|RMSE|R2$|AIC|Log.Lik.',
output ='kableExtra'
)| navco LPM | navco logit | navco probit | |
|---|---|---|---|
| goal type [Regime Change] | 0.097 | 0.398 | 0.248 |
| [−0.242, 0.436] | [−1.029, 1.999] | [−0.637, 1.171] | |
| goal type [Secession] | −0.271 | −1.641 | −0.940 |
| [−0.631, 0.090] | [−3.363, 0.149] | [−1.941, 0.080] | |
| goal type [Self determination] | 0.146 | 0.594 | 0.371 |
| [−0.215, 0.506] | [−0.919, 2.266] | [−0.570, 1.346] | |
| Num.Obs. | 352 | 352 | 352 |
| R2 Adj. | 0.061 | ||
| BIC | 507.3 | 475.6 | 475.6 |
| 95% CI in brackets |
For logit, the coefficients are \(ln \frac{p}{1-p}\) of Y=1. So exponentiation can give you some sense for the probabilities if you’re used to thinking in those terms. After exponentiation, the intercept is the log odds when everything = 0.
(Intercept) goal_typeRegime Change
0.6000000 1.4885496
goal_typeSecession goal_typeSelf determination
0.1937985 1.8115942
[1] 0.5731707
[1] 1.342857
And the slope coefficient is the impact of a one unit change in X on the relative odds ratio compared to the baseline.
(Intercept) goal_typeRegime Change
0.6000000 1.4885496
goal_typeSecession goal_typeSelf determination
0.1937985 1.8115942
[1] 0.2978723
# the odds ratio
orat_viol<-pr_viol/(1-pr_viol)
# the relative odds ratio compared to the baseline
orat_viol/orat[1] 0.3159252
(Probit coefficients don’t work this way, which is probably why they’re less popular)
Rather than try to interpret logit/probit coefficients in their untransformed states, its almost always better to convert them to predicted probabilities and marginal effects at some meaningful levels of the coefficients.
Predicted probability is just the model prediction for Y=1 at some level of the independent variables
Marginal effect is just the predicted effect of a one unit change in X on the predicted probability of Y
In a logit model, the predicted probability for some observation is just the inverse logistic of the predicted linear value of y. Or
\[ P = \frac{1}{1+e^-(B_0+B_1X_1+B_2X_2 +...)} \]
(less complicated than it looks.)
In plain English, you just take the same plug-in-numbers approach you take to get a prediction from OLS, and then do this:
(Intercept) goal_typeRegime Change
-0.5108256 0.3978022
goal_typeSecession goal_typeSelf determination
-1.6409366 0.5942072
[1] 0.5731707
[1] 0.5731707
To get the marginal effect from a given point, you just need to see what happens with a one unit change in X for some unit. So the marginal effect of moving from a non-violent campaign to a violent campaign is :
Fortunately, we can just get the predictions using the predict function with a slight modification (and this works for probit models as well)
To get a marginal effect for some other hypothetical value, we can use the newdata argument and create a new data set with specific values assigned to each predictor
navco_mod2<-glm(SUCCESS ~ VIOL + SECESSION + percent_pop, data=navco, family='binomial')
# pr success for a non-violent secessionist movement that has 1% popular participation
case_0 <- predict(navco_mod2, type='response', newdata=data.frame("VIOL" = 0,
"SECESSION" = 1,
"percent_pop" = .01
))
# pr success for a violent secessionist movement that has 2% participation
case_1 <- predict(navco_mod2, type='response', newdata=data.frame("VIOL" = 0,
"SECESSION" = 1,
"percent_pop" =.02
))
case_1 - case_0 1
0.06916748
Note that the marginal effect of a one unit increase in a variable is higher or lower depending on our location along the probability curve: as we approach probabilities of zero and one, the slope gets more shallow.
Getting a standard error by hand is a bit of a pain here. We could bootstrap! However, most statistical software will default to using the faster delta method.
Designing Women star Delta Burke, who (I assume) invented the Delta method
The ggeffects package automates a lot of the process here:
We can get adjusted predictions and confidence intervals with one function:
And automatically generate a nice plot for the predicted results:
And once we have a series of predicted probabilities, we can easily estimate average marginal effects across those cases:
Alternatively, we can generate marginal effects using the marginaleffects package. By default, the avg_comparisons function will return the average marginal effect of a one unit change in each covariate with all other covariates held at observed values:
Term Contrast Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
SECESSION 1 - 0 -0.318 0.0595 -5.35 <0.001 23.4 -0.435 -0.2015
VIOL 1 - 0 -0.147 0.0517 -2.85 0.0044 7.8 -0.249 -0.0459
percent_pop +1 0.574 0.0240 23.89 <0.001 416.6 0.527 0.6209
Type: response
Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high
As long as the link function is the same, you can make valid model comparisons using the AIC/BIC values or the log-likelihoods. There are also some fit statistics that are meant to approximate \(R^2\) values, but there’s no consensus on what they are supposed to mean.
| model 1 | model 2 | |
|---|---|---|
| goal type [Regime Change] | 0.398 | |
| [−1.029, 1.999] | ||
| goal type [Secession] | −1.641 | |
| [−3.363, 0.149] | ||
| goal type [Self determination] | 0.594 | |
| [−0.919, 2.266] | ||
| VIOL | −0.696 | |
| [−1.175, −0.221] | ||
| SECESSION | −1.908 | |
| [−3.137, −0.956] | ||
| percent_pop | 41.445 | |
| [16.415, 73.714] | ||
| Num.Obs. | 352 | 352 |
| BIC | 475.6 | 434.9 |
| 95% CI in brackets |
Of course, if you’ve got the data/time I think the best approach is to compare models based on something empirical like their ability to predict held-out data:
library(caret)
# leave one out cross-validation
train.control <- trainControl(method = "LOOCV")
# convert outcomes to factor variables:
navco<-navco|>
mutate(outcome = factor(SUCCESS, labels=c("failure", "success")))
cv_0 <- train(outcome ~ VIOL , data=navco,
method = "glm",
family ='binomial',
trControl = train.control)
cv_1 <- train(outcome ~ VIOL + SECESSION + participation_ntile, data=navco,
method = "glm",
family ='binomial',
trControl = train.control)
cv_2 <- train(outcome ~ VIOL , data=navco,
method = "rpart",
trControl = train.control)